[1] 8
CMOR Lunch’n’Learn
8 March 2023
Ross Wilson
Source
Console
Environment/
History
Files/Plots/
Packages/
Help/Viewer
numeric, character, logical (TRUE and FALSE values only), and integer
complex and raw, but we don’t need to worry about thosetidyversetidyverse’ is a collection of packages created by the company that makes RStudiotibble: a replacement for base data framesreadr: read tabular data like csv files (also readxl for Excel files, haven for SPSS/Stata/SAS, and others for different file types)dplyr: data manipulationtidyr: reshaping and tidying dataggplot2: creating plotspurrr: functional programmingstringr: working with character stringsforcats: working with factor variablesread_csv():# A tibble: 1,704 x 6
country year pop continent lifeExp gdpPercap
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Afghanistan 1952 8425333 Asia 28.8 779.
2 Afghanistan 1957 9240934 Asia 30.3 821.
3 Afghanistan 1962 10267083 Asia 32.0 853.
4 Afghanistan 1967 11537966 Asia 34.0 836.
5 Afghanistan 1972 13079460 Asia 36.1 740.
6 Afghanistan 1977 14880372 Asia 38.4 786.
7 Afghanistan 1982 12881816 Asia 39.9 978.
8 Afghanistan 1987 13867957 Asia 40.8 852.
9 Afghanistan 1992 16317921 Asia 41.7 649.
10 Afghanistan 1997 22227415 Asia 41.8 635.
# ... with 1,694 more rows
dplyrselect() only a subset of variables# A tibble: 1,704 x 3
year country gdpPercap
<dbl> <chr> <dbl>
1 1952 Afghanistan 779.
2 1957 Afghanistan 821.
3 1962 Afghanistan 853.
4 1967 Afghanistan 836.
5 1972 Afghanistan 740.
6 1977 Afghanistan 786.
7 1982 Afghanistan 978.
8 1987 Afghanistan 852.
9 1992 Afghanistan 649.
10 1997 Afghanistan 635.
# ... with 1,694 more rows
dplyrfilter() only a subset of observations# A tibble: 30 x 6
country year pop continent lifeExp gdpPercap
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Albania 2007 3600523 Europe 76.4 5937.
2 Austria 2007 8199783 Europe 79.8 36126.
3 Belgium 2007 10392226 Europe 79.4 33693.
4 Bosnia and Herzegovina 2007 4552198 Europe 74.9 7446.
5 Bulgaria 2007 7322858 Europe 73.0 10681.
6 Croatia 2007 4493312 Europe 75.7 14619.
7 Czech Republic 2007 10228744 Europe 76.5 22833.
8 Denmark 2007 5468120 Europe 78.3 35278.
9 Finland 2007 5238460 Europe 79.3 33207.
10 France 2007 61083916 Europe 80.7 30470.
# ... with 20 more rows
dplyrmutate() to create new variables# A tibble: 1,704 x 7
country year pop continent lifeExp gdpPercap gdp_billion
<chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 Afghanistan 1952 8425333 Asia 28.8 779. 6.57
2 Afghanistan 1957 9240934 Asia 30.3 821. 7.59
3 Afghanistan 1962 10267083 Asia 32.0 853. 8.76
4 Afghanistan 1967 11537966 Asia 34.0 836. 9.65
5 Afghanistan 1972 13079460 Asia 36.1 740. 9.68
6 Afghanistan 1977 14880372 Asia 38.4 786. 11.7
7 Afghanistan 1982 12881816 Asia 39.9 978. 12.6
8 Afghanistan 1987 13867957 Asia 40.8 852. 11.8
9 Afghanistan 1992 16317921 Asia 41.7 649. 10.6
10 Afghanistan 1997 22227415 Asia 41.8 635. 14.1
# ... with 1,694 more rows
dplyrsummarise() to calculate summary statistics# A tibble: 1 x 1
mean_gdpPercap
<dbl>
1 7215.
dplyrThe power of dplyr is in combining several commands using ‘pipes’
The previous command could be written:
tidyr package helps us transform our data from one shape to the other
pivot_wider() takes a long dataset and makes it widerpivot_longer() takes a wide dataset and makes it longer# A tibble: 1,704 x 6
country year pop continent lifeExp gdpPercap
<chr> <dbl> <dbl> <chr> <dbl> <dbl>
1 Afghanistan 1952 8425333 Asia 28.8 779.
2 Afghanistan 1957 9240934 Asia 30.3 821.
3 Afghanistan 1962 10267083 Asia 32.0 853.
4 Afghanistan 1967 11537966 Asia 34.0 836.
5 Afghanistan 1972 13079460 Asia 36.1 740.
6 Afghanistan 1977 14880372 Asia 38.4 786.
7 Afghanistan 1982 12881816 Asia 39.9 978.
8 Afghanistan 1987 13867957 Asia 40.8 852.
9 Afghanistan 1992 16317921 Asia 41.7 649.
10 Afghanistan 1997 22227415 Asia 41.8 635.
# ... with 1,694 more rows
gapminder_wide <- gapminder %>%
pivot_wider(id_cols = c(country, continent),
names_from = year, values_from = c(pop, lifeExp, gdpPercap))
gapminder_wide# A tibble: 142 x 38
country continent pop_1952 pop_1957 pop_1962 pop_1967 pop_1972 pop_1977
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghanistan Asia 8425333 9240934 10267083 11537966 13079460 14880372
2 Albania Europe 1282697 1476505 1728137 1984060 2263554 2509048
3 Algeria Africa 9279525 10270856 11000948 12760499 14760787 17152804
4 Angola Africa 4232095 4561361 4826015 5247469 5894858 6162675
5 Argentina Americas 17876956 19610538 21283783 22934225 24779799 26983828
6 Australia Oceania 8691212 9712569 10794968 11872264 13177000 14074100
7 Austria Europe 6927772 6965860 7129864 7376998 7544201 7568430
8 Bahrain Asia 120447 138655 171863 202182 230800 297410
9 Bangladesh Asia 46886859 51365468 56839289 62821884 70759295 80428306
10 Belgium Europe 8730405 8989111 9218400 9556500 9709100 9821800
# ... with 132 more rows, and 30 more variables: pop_1982 <dbl>,
# pop_1987 <dbl>, pop_1992 <dbl>, pop_1997 <dbl>, pop_2002 <dbl>,
# pop_2007 <dbl>, lifeExp_1952 <dbl>, lifeExp_1957 <dbl>, lifeExp_1962 <dbl>,
# lifeExp_1967 <dbl>, lifeExp_1972 <dbl>, lifeExp_1977 <dbl>,
# lifeExp_1982 <dbl>, lifeExp_1987 <dbl>, lifeExp_1992 <dbl>,
# lifeExp_1997 <dbl>, lifeExp_2002 <dbl>, lifeExp_2007 <dbl>,
# gdpPercap_1952 <dbl>, gdpPercap_1957 <dbl>, gdpPercap_1962 <dbl>, ...
gapminder_wide %>%
pivot_longer(pop_1952:gdpPercap_2007,
names_to = c(".value", "year"), names_sep = "_")# A tibble: 1,704 x 6
country continent year pop lifeExp gdpPercap
<chr> <chr> <chr> <dbl> <dbl> <dbl>
1 Afghanistan Asia 1952 8425333 28.8 779.
2 Afghanistan Asia 1957 9240934 30.3 821.
3 Afghanistan Asia 1962 10267083 32.0 853.
4 Afghanistan Asia 1967 11537966 34.0 836.
5 Afghanistan Asia 1972 13079460 36.1 740.
6 Afghanistan Asia 1977 14880372 38.4 786.
7 Afghanistan Asia 1982 12881816 39.9 978.
8 Afghanistan Asia 1987 13867957 40.8 852.
9 Afghanistan Asia 1992 16317921 41.7 649.
10 Afghanistan Asia 1997 22227415 41.8 635.
# ... with 1,694 more rows
tidyverse packagesggplot2 in a later sessionpurrr provides tools for functional programming
purrr, we can use map() (and similar) to apply a function to multiple inputs and extract all of the outputsstringr and forcats are worth looking at if you need to work with string or factor variables – we won’t cover them hereR has evolved over time to fit a variety of different needs and use cases
This gives it great flexibility and ability to meet the needs of different users
But, the diversity of interfaces, data structures, implementation, and fitted model objects can be a challenge
We’ll cover some tools to help bridge that gap
A few common (but not universal) features:
Models are described by a formula: e.g. y ~ x + z
Data are provided in a data frame (or, equivalently, tibble)
coef(), vcov(), summary() can be used to extract the coefficient estimates, variance covariance matrix, and to print a summary of the fitted model
broom packagebroom provides several functions to convert fitted model objects to tidy tibbles
The package works with several model fitting functions from base R and commonly-used packages
Functions:
tidy(): construct a tibble that summarises the statistical findings (coefficients, p-values, etc.)augment(): add new columns to the original data (predictions/fitted values, etc.)glance(): construct a one-row summary of the model (goodness-of-fit, etc.)Linear regression models can be fitted with the lm function (in the stats package, part of base R)
We’ll start by loading the gapminder dataset from the previous session:
?lm to find the documentationlinear_regression_model is now a fitted model object
If we want, we can look at how this object is actually stored:
List of 12
$ coefficients : Named num [1:2] 5.40e+01 7.65e-04
..- attr(*, "names")= chr [1:2] "(Intercept)" "gdpPercap"
$ residuals : Named num [1:1704] -25.8 -24.3 -22.6 -20.6 -18.4 ...
..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
$ effects : Named num [1:1704] -2455.1 311.1 -21.6 -19.6 -17.5 ...
..- attr(*, "names")= chr [1:1704] "(Intercept)" "gdpPercap" "" "" ...
$ rank : int 2
$ fitted.values: Named num [1:1704] 54.6 54.6 54.6 54.6 54.5 ...
..- attr(*, "names")= chr [1:1704] "1" "2" "3" "4" ...
$ assign : int [1:2] 0 1
$ qr :List of 5
..$ qr : num [1:1704, 1:2] -41.2795 0.0242 0.0242 0.0242 0.0242 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:1704] "1" "2" "3" "4" ...
.. .. ..$ : chr [1:2] "(Intercept)" "gdpPercap"
.. ..- attr(*, "assign")= int [1:2] 0 1
..$ qraux: num [1:2] 1.02 1.02
..$ pivot: int [1:2] 1 2
..$ tol : num 1e-07
..$ rank : int 2
..- attr(*, "class")= chr "qr"
$ df.residual : int 1702
$ xlevels : Named list()
$ call : language lm(formula = lifeExp ~ gdpPercap, data = gapminder)
$ terms :Classes 'terms', 'formula' language lifeExp ~ gdpPercap
.. ..- attr(*, "variables")= language list(lifeExp, gdpPercap)
.. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. ..- attr(*, "dimnames")=List of 2
.. .. .. ..$ : chr [1:2] "lifeExp" "gdpPercap"
.. .. .. ..$ : chr "gdpPercap"
.. ..- attr(*, "term.labels")= chr "gdpPercap"
.. ..- attr(*, "order")= int 1
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(lifeExp, gdpPercap)
.. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "gdpPercap"
$ model :'data.frame': 1704 obs. of 2 variables:
..$ lifeExp : num [1:1704] 28.8 30.3 32 34 36.1 ...
..$ gdpPercap: num [1:1704] 779 821 853 836 740 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language lifeExp ~ gdpPercap
.. .. ..- attr(*, "variables")= language list(lifeExp, gdpPercap)
.. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
.. .. .. ..- attr(*, "dimnames")=List of 2
.. .. .. .. ..$ : chr [1:2] "lifeExp" "gdpPercap"
.. .. .. .. ..$ : chr "gdpPercap"
.. .. ..- attr(*, "term.labels")= chr "gdpPercap"
.. .. ..- attr(*, "order")= int 1
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(lifeExp, gdpPercap)
.. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
.. .. .. ..- attr(*, "names")= chr [1:2] "lifeExp" "gdpPercap"
- attr(*, "class")= chr "lm"
Call:
lm(formula = lifeExp ~ gdpPercap, data = gapminder)
Residuals:
Min 1Q Median 3Q Max
-82.754 -7.758 2.176 8.225 18.426
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.396e+01 3.150e-01 171.29 <2e-16 ***
gdpPercap 7.649e-04 2.579e-05 29.66 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.49 on 1702 degrees of freedom
Multiple R-squared: 0.3407, Adjusted R-squared: 0.3403
F-statistic: 879.6 on 1 and 1702 DF, p-value: < 2.2e-16
tidy() from the broom package will put them into a nice tidy tibble# A tibble: 2 x 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 54.0 0.315 171. 0
2 gdpPercap 0.000765 0.0000258 29.7 3.57e-156
glance() gives us several overall model statistics# A tibble: 1 x 12
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.341 0.340 10.5 880. 3.57e-156 1 -6422. 12850. 12867.
# ... with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
ggplot2ggplot2 is the most well-developed and widely usedgapminder dataset from the previous sessionggplot2ggplot objectKeeping track of changes to data and code (and being able to revert to a previous version if things go wrong) is critical for reproducible research
This is particularly true when collaborating with others
The best way to do this is with a version control system such as Git
What not to put under version control
Can be as simple as something like:
Shows in what order to run the scripts, and allows us to resume from the middle (if, for example, you have only changed the file 02-create-histogram.R, there is no need to redo the first two steps, but we do need to rerun the create-histogram and render-report steps
For more complicated (or long-running) analyses, we may want to explicitly specify dependencies and let the computer figure out how to get everything up-to-date
Two good tools for doing this:
Advantages of an automated pipeline like this:
Make is a system tool, designed for use in software development, to specify targets, commands, and dependencies between files and selectively re-run commands when dependencies change
A Makefile is a plain text file specifying a list of these targets (intermediate/output files in the analysis workflow), commands (to create the targets), and dependencies (input files needed for each command)
words.txt: /usr/share/dict/words
cp /usr/share/dict/words words.txt
histogram.tsv: histogram.r words.txt
Rscript $<
histogram.png: histogram.tsv
Rscript -e 'library(ggplot2); qplot(Length, Freq, data=read.delim("$<")); ggsave("$@")'
report.html: report.rmd histogram.tsv histogram.png
Rscript -e 'rmarkdown::render("$<")'targets is an R package designed to do something very similar, but specifically designed for R projectsplan <- list(
tar_file(words, download_words()),
tar_target(frequency_table, summarise_word_lengths(words)),
tar_target(histogram, create_histogram(frequency_table)),
tar_file(report, render_report("reports/report.rmd", frequency_table, histogram))
)(if you are already writing
script files for your analyses)
(will get better
with practice)
drake package, which was a predecessor package of targets. The same concepts all applyAdvantages:
Advantages: